[CPU][float8] Add scaled_embedding_bag kernel #2686

shiyang-weng · 2025-08-05T01:37:04Z

Implemented FP8 QEmbeddingBag on CPU, currently supporting:

include_last_offset=True
mode="sum"

Next steps

expand supported modes.
Use fp8 instructions instead

pytorch-bot · 2025-08-05T01:37:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2686

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Multiple CI trunk failures after landing https://github.com/pytorch/pytorch/pull/161002

✅ No Failures

As of commit df09264 with merge base 9056c46 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-08-05T01:39:42Z

test/quantization/test_quant_api.py

+        "CPU" not in torch._C._dispatch_dump("torchao::qembeddingbag"),
+        reason="cpp kernels not built",
+    )
+    def test_embeddingbag_cpu(self):


the test should be added here I think: https://github.com/pytorch/ao/blob/main/test/test_ops.py

pytorch-bot · 2025-08-07T02:19:08Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'topic: new feature' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

shiyang-weng · 2025-08-07T02:20:17Z

@pytorchbot label "topic: new feature"

Xia-Weiwen

LGTM. Have you run some benchmark to ensure it's not too slow?

torchao/csrc/cpu/qembeddingbag.cpp

torchao/ops.py

torchao/csrc/cpu/qembeddingbag.cpp

shiyang-weng · 2025-08-12T03:34:47Z

@jerryzh168 Could you help review this pr

jerryzh168 · 2025-08-14T23:20:13Z

torchao/ops.py

@@ -70,6 +70,9 @@
 lib.define(
    "da8w4_linear_cpu(Tensor input, Tensor input_scales, Tensor input_qzeros, Tensor weight, Tensor weight_scales, Tensor weight_qzeros, Tensor compensation, Tensor? bias, ScalarType output_dtype) -> Tensor"
 )
+lib.define(
+    "qembeddingbag(Tensor qweight, Tensor indices, Tensor offsets, Tensor weight_scale, float o_scale, int mode, bool include_last_offset) -> Tensor"


is this the same as https://github.com/pytorch/pytorch/blob/371eacb2ae4ecdabc52ea4634ed21558df2f3bab/aten/src/ATen/native/native_functions.yaml#L2368C1-L2369C1? with the only difference of qweight being float8?

@jerryzh168 Thanks for reviewing. Yes, I think so, except that the implementation in this PR has limited functionality so far.

This operator is used for inference. So I did not add any parameters related to the gradient, including scale_grad_by_freq, sparse, per_sample_weights, padding_idx.

I think we should add this to pytorch directly if that's the case, float8 is a native dtype in pytorch, so I think it makes most of the sense to just add the functionality there, we can error out in the op if some arg combination is not supported or invalid for float8

Intel's platform has fp8 instructions. When we are ready, we hope to update this kernel based on fp8 instructions. As far as I know, the latest GCC is required. Is it difficult to support in PyTorch?

So, since this PR adds a quantized version of this op, do you think it better to be added in Torchao rather than in torch core? Thanks.

yeah my question is can this be implemented with extending the embedding_bag op in pytorch and do the scaling in torchao? or will performance be a concern here

This is a memory bound operator. Repeated reading and writing will lead to significant performance degradation. For example, if we originally need to read and write once(this situation will also occur many times for DLRM), we will need to read and write twice after do the scaling separately, and the performance will be reduced by half.

OK, sounds good, maybe rename this to _scaled_embedding_bag to follow these ops: https://github.com/pytorch/pytorch/blob/31a41daff49f2cde941d8b9e35cb2eaeeb606c0d/aten/src/ATen/native/native_functions.yaml#L7135

using _ to indicating it's prototype op since you may want to update the arg list expand hardware coverage etc. later

jerryzh168 · 2025-08-22T03:30:35Z

test/test_ops.py

+            mode_enum,
+            include_last_offset,
+        ).to(dtype)
+        torch.testing.assert_close(refe_out, test_out, atol=0, rtol=0)


is this too strict?

changed to 1e-5

…_krnl

shiyang-weng · 2025-08-28T01:47:59Z

@pytorchbot merge

pytorchmergebot · 2025-08-28T01:48:37Z

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

Facebook CLA Check

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: superuser

jerryzh168 · 2025-08-28T01:49:12Z

@pytorchbot merge

we just manually merge with the button in torchao

jerryzh168 · 2025-08-28T01:50:35Z

also is this op built by default? I think ideally it can be optional so it does not impact the normal build. we have seen some errors when some other kernels from prototype feature that breaks the torchao build

shiyang-weng · 2025-08-28T02:18:59Z

also is this op built by default? I think ideally it can be optional so it does not impact the normal build. we have seen some errors when some other kernels from prototype feature that breaks the torchao build

Like other kernel on cpu/*.cpp, it is not built by default and built only with USE_CPU_KERNELS=1.

Introduced recently in #2686

shiyang-weng marked this pull request as draft August 5, 2025 01:37

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 5, 2025

jerryzh168 reviewed Aug 5, 2025

View reviewed changes

shiyang-weng added 5 commits August 5, 2025 09:14

add embeddingbag kernel

a695557

switch to use cvtfp8e4m3_fp32

025aa16

improve code style

ab62099

rm unused buf

badb85d

mv ut to test/test_ops.py

8069e4a

This comment was marked as outdated.

Sign in to view

pytorch-bot bot added the topic: new feature Use this tag if this PR adds a new feature label Aug 7, 2025

shiyang-weng marked this pull request as ready for review August 7, 2025 02:43

Xia-Weiwen approved these changes Aug 7, 2025

View reviewed changes

torchao/csrc/cpu/qembeddingbag.cpp Outdated Show resolved Hide resolved

torchao/csrc/cpu/qembeddingbag.cpp Outdated Show resolved Hide resolved

torchao/ops.py Outdated Show resolved Hide resolved

torchao/csrc/cpu/qembeddingbag.cpp Outdated Show resolved Hide resolved

shiyang-weng added 3 commits August 8, 2025 10:28

refine kernel

9d0f7a5

add test case

ae07dc6

add more assert

72f5017

Xia-Weiwen requested a review from jerryzh168 August 11, 2025 02:10

add more test case

0e10992

jerryzh168 reviewed Aug 14, 2025

View reviewed changes

shiyang-weng added 3 commits August 21, 2025 22:49

fix accuracy issue

9617b33

rename qembeddingbag to _scaled_embedding_bag

6c92923

improve code style

62ac7d3

shiyang-weng requested a review from Xia-Weiwen August 22, 2025 03:29

jerryzh168 approved these changes Aug 22, 2025

View reviewed changes

jerryzh168 reviewed Aug 22, 2025

View reviewed changes

shiyang-weng changed the title ~~[CPU][float8] Add QEmbeddingbag kernel~~ [CPU][float8] Add scaled_embedding_bag kernel Aug 22, 2025

change atol and rtol

623285e

Merge remote-tracking branch 'origin/main' into wengshiy/embeddingbag…

1c2c154

…_krnl

Xia-Weiwen approved these changes Aug 22, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into wengshiy/embeddingbag…

df09264

…_krnl

pytorchmergebot added the merging label Aug 28, 2025

pytorchmergebot removed the merging label Aug 28, 2025

Xia-Weiwen merged commit 2a53216 into pytorch:main Aug 28, 2025
20 checks passed

andrewor14 added a commit that referenced this pull request Aug 29, 2025

Remove unused cpp variable, breaking style checks

08dc889

Introduced recently in #2686

andrewor14 mentioned this pull request Aug 29, 2025

Remove unused cpp variable, breaking style checks #2909

Merged

andrewor14 added a commit that referenced this pull request Aug 29, 2025

Remove unused cpp variable, breaking style checks (#2909)

1bb1a40

Introduced recently in #2686

[CPU][float8] Add scaled_embedding_bag kernel #2686

[CPU][float8] Add scaled_embedding_bag kernel #2686

Uh oh!

Conversation

shiyang-weng commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2686

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

pytorch-bot bot commented Aug 7, 2025

Uh oh!

shiyang-weng commented Aug 7, 2025

Uh oh!

Xia-Weiwen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shiyang-weng commented Aug 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shiyang-weng commented Aug 28, 2025

Uh oh!

pytorchmergebot commented Aug 28, 2025

Merge failed

Uh oh!

jerryzh168 commented Aug 28, 2025

Uh oh!

jerryzh168 commented Aug 28, 2025

Uh oh!

Uh oh!

shiyang-weng commented Aug 28, 2025

Uh oh!

Uh oh!

shiyang-weng commented Aug 5, 2025 •

edited

Loading

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

jerryzh168 Aug 15, 2025 •

edited

Loading

jerryzh168 Aug 22, 2025 •

edited

Loading